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On  the  Performance  of  Some  Subset  Selection  Procedures* 

by 

Shanti  S.  Gupta  and  Jason  C.  Hsu 
Purdue  University  and  The  Ohio  State  University 

1 . Introduction  and  statement  of  the  problem 

Research  in  the  area  of  subset  selection  has  progressed  steadily  since 

the  1950' s.  For  many  problems,  there  are  heuristical ly  proposed  procedures. 

When  there  are  competing  procedures  for  a given  problem,  performance 

comparisons  are  often  available.  However,  these  performance  comparisons 

generally  do  not  establish  directly  any  optimality  property  of  the  procedures 

studied.  In  this  paper,  we  restrict  ourselves  to  the  problem  of  selecting  a 

subset  of  normal  populations.  The  approaches  and  results  of  some  previous 

studies  are  discussed  briefly  and  then  the  result  of  a new  Monte  Carlo  study 

is  presented.  We  now  make  the  problem  precise. 

Suppose  n independent  observations  are  obtained  from  each  of  k independent 

normal  populations  having  unknown  (unequal)  means  and  a common  known  variance. 

By  sufficiency  we  can  restrict  our  attention  to  the  sample  means.  For 

i = l,...,k,  let  X^.  be  the  sample  mean  of  the  n observations  from  the  ith 

population  and  let  X = (X^ ’^k^‘  *'*''^bout  loss  of  generality,  we  assume 

that  the  common  known  variance  of  the  X.  is  1,  so  that  the  joint  distribution 
k ’ 

of  the  X.  is  n ♦(x.-e.. ) where  ^ is  the  standard  normal  distribution  function 

1 i=l  1 1 

and  9^.  is  the  unknown  mean  of  the  ith  population.  Let  9 = (9^,...,9|^) 
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and  let  be  the  ordered  means.  Populations  having  larger 

means  are  considered  better  than  those  having  smaller  means.  The 
population  associated  with  is  considered  the  'best'  population.  If 
more  than  one  0^  are  tied  for  then  arbitrarily  one  of  them  is  chosen 
to  be  We  are  interested  in  selecting  the  'best'  population.  However, 

if  the  observed  values  so  indicate,  we  want  to  select  more  than  one  popula- 
tion, i.e.  a subset  of  populations,  to  guard  against  the  possibility  of 
making  an  error.  Thus,  the  action  space  & of  the  subset  selection  problem 
can  be  taken  as  the  set  of  all  non-empty  subsets  of  {!,..., k),  where  taking 
the  action  a € G means  the  selection  of  those  populations  whose  indices  are 
in  a. 


2.  Some  proposed  procedures  and  known  results 


For  any  a 6 G , let 


CS(0,a)  = { 


1 

0 


if  e[-|^j  € {0^:i  6 a) 
otherwise 


Let  ICS(0,a)  = l-CS(e,a)  and  |a|  = no.  of  elements  in  a.  CS  and  ICS  stand 

for  'correct  selection'  and  ' incorrect  selection', respectively.  For  any 

subset  selection  procedure  R,  if  6|^(x,a)  denotes  the  probability  assigned  to 

a by  R having  observed  x,  then  let  P„[CS|R]  = E„[  7 CS(0,a)6D(X,a)]  (the 

2 2 a€G  ' ' 

probability  of  a correct  selection)  and  E„[S|R]  = E-[  7 |al6D(X,a)]  (the 

2 2 aGG  ' 

expected  subset  size).  For  fixed  P*,  0 £ P*  ^ 1 , a procedure  R is  said  to 

satisfy  the  P*  condition  if  inf^P  [CS|R]  > P*. 

0GIR'^  2 

For  the  normal  populations  problem.  Seal  (1955)  proposed  the  following 

class  C of  procedures.  For  c = (c^ ,. . . ) , Cj  ^0  (j  = 1 ,. . . ,k-l ) , 

k-1 

I c.  = 1,  the  procedure  R (P*)  is  as  follows: 
j=l  ^ ~ 
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k-1 

R-(P*):  Select  the  ith  population  iff  > T c.Xr.n  - d (P*) 
c 1 - J LJJ  c 

where  X^^  I-*-!  ^[k-i]  ordered  sample  means  excluding  X^  and 

d^(P*)  is  the  smallest  number  such  that  the  P*  condition  is  satisfied. 

When  c = (1/k-l . ,1/k-l ) , the  procedure  Rj.(P*)  and  its  associated  constant 

dp(P*)  will  be  denoted  by  R (P*)  and  d (P*)  respectively.  We  will  be 


avg 


avg 


interested  in  the  class  of  procedures  <Ravg^^*^‘  0 ^ P*  ^ 1}.  Note  however, 

when  P*  < 1/2,  Pavg(P*)  '"ay  select  an  empty  set.  Hence  we  modify  P^ygCP*) 
as  follows: 

k-1 

Ravg(P*)=  Select  the  ith  population  iff  X^  i ^[k-1]  > [ x^j^/(k-l)- 

J 1 

%<’’*>• 

The  class  of  procedures  ^Pavg(P*)*  0 £ P*  ^ 1}  will  henceforth  be  referred 
to  as  'average  type  procedures'. 

For  the  more  general  problem  of  selecting  a subset  to  contain  the 

population  having  the  largest  location  parameter,  Gupta  (1965)  proposed  the 

following  class  of  procedures.  Let  X^,...,X|^  be  independent  random  variables 

having  the  joint  distribution  n F(x.-0,. ).  To  select  a subset  to  contain  the 

1=1  ^ ’ 

population  associated  with  where  is  defined  as  before,  the  procedure 

R„,„(P*)  is  as  follows: 

mdX 

R^ax^^*^'  Select  the  ith  population  iff  X^.  > X^j^  .jj  - 

where  X|-|^_^]  is  defined  as  before  and  d^j^g^CP*)  Is  the  smallest  number  such 
that  the  P*  condition  is  satisfied.  We  will  be  interested  in  the  class  of 
procedures  1 P*  1 which  will  henceforth  be  referred  to 

as  'maximum  type  procedures'.  Note  that  when  applied  to  the  normal  populations 
problan,  R^,(P*)  is  „_,j{p.)  in  c. 
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For  the  normal  populations  problem  a number  of  performance  comparisons 
have  been  made.  Usually  attention  is  restricted  to  some  subset  of  the  parameter 
space  (e.g.  parameter  points  having  the  slippage  configuration,  or  sequences 
of  parameter  points  having  certain  limiting  behavior),  and  the  operating 
characteristics  of  some  competing  procedures  (e.g.  R3yg(P*)  ^nd 
compared.  A representative  but  not  exhaustive  list  of  studies  of  this  type 
is  Seal  (1957),  Deely  and  Gupta  (1968),  Deverman  (1969)  and  Deverman  and 
Gupta  (1969).  Generally  the  results  indicate  that,  in  terms  of  the  expected 
subset  size  Eq[S|R]  and  related  criteria,  superior  to  Rayg(P*) 

over  much  of  the  parameter  space.  This,  however,  does  not  establish  directly 
any  optimality  property  of  the  procedure  R_,„(P*). 

iilaX 

More  recently,  Berger  (1977)  proved  that  minimax  with  respect 

to  EgCSjR]  among  all  procedures  satisfying  the  P*  condition. 

In  Berger  and  Gupta  (1977),  it  is  proved  that  R„,aj((P*)  is  minimax  and 
admissible  with  respect  to  the  maximum  of  the  probability  of  selecting  each 
of  the  non-best  populations  among  all  non -randomized  'just*  translation 
invariant  and  permutationally  invariant  procedures  satisfying  the  P*  condition. 
Hence  Rniaj((P*)  is  optimal  according  to  the  above  criteria. 

3.  The  decision-theoretic  approach  and  the  loss  functions 

The  approach  taken  in  the  present  study  is  to  compare  the  average 
performance  of  subset  selection  procedures,  where  the  average  is  taken  over 
the  parameter  space  with  respect  to  some  prior.  Thus  the  quantities  to  be 
compared  are  the  integrated  risks.  For  a given  prior,  the  optimal  procedure 
is  the  corresponding  Bayes  procedure  by  definition.  However,  Bayes  procedures 
are  often  difficult  to  use.  Thus,  it  is  reasonable  to  look  for  procedures 
that  are  easy  to  use  and  which  are  approximately  Bayes. 
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In  classical  performance  studies  of  subset  selection  procedures, 
the  measures  of  loss  most  often  used  have  been  ICS(e,a)  and  |a|  and 
quantities  related  to  lal-  More  recently,  Goel  and  Rubin  (1977)  studied 
the  subset  selection  problem  from  a Bayesian  point  of  view  using  loss 
functions  that  are  linear  combinations  of  0r,^i  - max  0.  and  |a|.  Bickel 

and  Yahav  (1977)  studied  the  behavior  of  Bayes  procedures  as  k -*•  » using 
loss  functions  that  are  linear  combinations  of  ICS(e,a)  and  eri,n-  I o./laj. 

LKJ  j 

Chernoff  and  Yahav  (1977),  employing  Monte  Carlo  techniques,  compared  the 

integrated  risks  with  respect  to  exchangeable  normal  priors  of  Bayes, 

maximum  type  and  fixed-size  procedures  of  Bechhofer  (1954)  using  loss 

functions  that  are  linear  combinations  of  0r. -i  - max  0.  and  0rj^-i  - I 0^/la| 

LKJ  J LKJ  J 

The  present  Monte  Carlo  study  parallels  Chernoff  and  Yahav's  in  that 
exchangeable  normal  priors  are  used  but  differs  in  that  the  loss  functions 
considered  are  linear  combinations  of  ICS(0,a)  and  |a|,  and  Bayes,  maximum 
type  and  average  type  procedures  are  compared.  The  four  loss  combinations 
that  have  been  used  are  presented  in  Figure  1.  ' 


ICS(0,a) 


ej-^]-ma^  e. 


OlSf*!:?'!?'!’;.’" 


‘Pl'ili  CT 
st-ClAL'l 


Figure  1 


Note  that  the  different  combinations  have  different  interpretations.  The 
combinations  (1)  and  (2)  correspond  to  situations  where  the  subset  selection 
procedure  is  used  as  a screening  procedure.  For  example,  in  developing  a 
new  drug,  a pharmaceutical  company  may  start  with  a number  of  ingredients 


n n ^ 
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known  to  have  beneficial  effects  (and  side  effects)  from  previous  experience, 

and  then  obtain  a collection  of  potentially  good  formulations  by  combining 

these  ingredients  in  different  proportions.  After  the  first  stage  of  testing, 

one  wants  to  reject  those  formulations  that  are  evidently  non-best  and  retain 

those  formulations  that  still  seem  potentially  best  for  further  study. 

Eventually,  if  the  development  is  successful,  only  one  formulation  will  be 

marketed.  Corresponding  to  this  situation  then,  loss  functions  that  depend 

only  on  the  best  selected  and  the  size  selected  are  reasonable.  On  the  other 

hand,  the  component  erK-i-  I 9 -/la I in  the  combinations  (3)  and  (4)  correspond 

j Si  J 

to  situations  where  all  those  selected  will  be  used.  This  is  the  case,  for 
example,  when  one  purchases  stocks  for  long  term  investment.  One  purchases 
stocks  of  more  than  one  company  to  guard  against  the  possibility  of  gross 
errors,  and  all  the  stocks  purchased  contribute  to  the  gain  or  loss.  We 
believe  the  distinction  between  screening- type  situations  and  non-screening 
type  situations  needs  to  be  pointed  out.  It  is  true  that  the  loss  combinations 
(3)  and  (4)  each  contains  a component  that  corresponds  to  screening  type 
situations. 

4.  The  Monte  Carlo  result 

Let  N{a,B)  denote  the  normal  distribution  with  mean  a and  variance- 
covariance  matrix  B,  then  our  model  is 

Xje  -v-  N(e,I) 

where  X = (X^ , . . . ,X|^) , 9 = (0^,...,e|^)  and  I is  the  identity  matrix. 

Consider  the  exchangeable  normal  prior 

0 'V  N(ml , rl  + sU) 

where  m,r,s  are  constants,  1 = (1,...,1),  U = I'l,  r > 0 and  -r/k  < s < r. 
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Then  jointly 


(X,e)  -x-  N((ml,ml), 


(l+r)I+sU  rl+sU 
rl+sU  rl+sU 


Hence  a posteriori 


where 


ejX  -v-  N(e,E) 


9 = m1+ (x-ml )[  (l+r)I+sU]”^  (rl+sU) 
= (r/l+r)x  + a multiple  of  1 


z = (rI+sU)-(rI+sU)[(l+r)I+sU]'  (rl+sU) 

= rl-(r^/l+r)l  + a multiple  of  U 
= (r/l+r)I  + a multiple  of  U. 

Consider  the  loss  function  L(e,a)  = c^ ICS(9,a)+C2 |a | where  CpC2  i 0, 
c^+C2  =1.  It  is  easy  to  see  that  for  this  loss  function  the  Bayes  procedure, 
denoted  by  Rg,  is  as  follows: 

RgZ  Select  the  ith  population  iff  X^.  > and/or 

P[0  I X]  ^ C2/C1 . 

If  we  denote  by  41,  „ the  normal  distribution  function  with  mean  a and 

a 9 D ** 

variance-covariance  matrix  B,  then 

P[9i=9[k]|x] 

•^^{9i=9[k]}^*(r/l+r)x+a  multiple  of  1,  (r/l+r)I+a  multiple  of 


{0i=0[k])  (r/l+r)x,(r/l+r)l' 


(4.1) 
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Hence  the  Bayes  procedure  is  translation  invariant  and  can  be  obtained  by 
numerical  integration.  The  following  computation  shows  that  the  integrated 
risk  of  any  translation  invariant  procedure  is  independent  of  m and  s so 
long  as  the  loss  function  is  translation  invariant. 

Integrated  Risk  - /H^Ue.aCx 

’ ’“**'**(r/l+r)x,(r/Ur)I^8''’*0,(l*r)I*!!*- 

u tU  *“ 

where  b and  g are  appropriate  constants. 

Since  both  maximum  type  and  average  type  procedures  are  translation 
invariant,  we  can  reduce  the  set  of  parameters  to  just  k,r  and  C2/C1. 

Monte  Carlo  comparisons  of  Bayes,  maximum  type  and  average  type 
procedures  were  carried  out  for  k = 3 and  k = 8.  The  range  of  r was 
r^  = (1.8)  , i = -4,..., 4 for  both  k = 3 and  k = 8.  The  range  of  c^/c^ 
was  c-j/c^  = 3^^^,  i = 2,. ..,8  for  k = 3 and  C1/C2  = 4,6,8^^^,  i = 3,. ..,9 
for  k = 8.  For  k = 3,  400  simulations  were  performed  at  each  (r,Ci/C2) 
pair,  while  for  k = 8,  the  number  of  simulations  was  200  each.  For  each 
simulation,  the  random  vector  X is  generated  according  to  its  marginal 
distribution.  By  numerically  integrating  the  expression  (4.1)  for  each  i, 
the  action  taken  by  the  Bayes  procedure  and  the  associated  posterior  risk 
are  obtained.  The  average  of  these  posterior  risks  then  serve  as  an  estimate 
of  the  Bayes  risk.  The  best  maximum  type  and  average  type  procedures  and 
the  regrets  incurred  by  using  them  are  estimated  by  examining  the  average 
regrets  corresponding  to  two  sufficiently  fine  grids  of  the  constants  d^^^^ 
and  dgyg,  where  the  two  grids  are  determined  from  the  result  of  a preliminary 
study.  Tables  lA  and  IB  give  for  each  (r,Ci/c2)  pair  the  estimated  Bayes 
risk,  the  estimated  regrets  incurred  by  the  best  maximum  type  procedure  and 


i 
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the  best  average  type  procedure,  and  the  estimated  standard  deviations  of 

these  estimates.  Tables  IIA  and  IIB  list  for  each  (r,  c^/c^)  pair  the 

estimated  constants  d and  d, , and  their  associated  P*  corresponding  to 

max  avg 

the  best  maximum  type  and  the  best  average  type  procedures.  As  can  be  seen 
from  Tables  lA  and  IB, both  maximum  type  and  average  type  procedures  do  almost 
as  well  as  the  Bayes  procedures  when  the  prior  is  concentrated  (i.e.  variance 
r is  small),  with  the  average  type  procedures  having  a slight  edge.  However, 
when  the  prior  is  diffuse  (r  large),  the  maximum  type  procedures  continue  to 
do  almost  as  well  as  the  Bayes  procedures,  while  the  average  type  procedures 
can  do  very  badly.  In  this  sense  then,  the  maximum  type  procedures  are  safe 
to  use.  As  by-products.  Tables  IIIA  and  IIIB  give  for  each  (r,c^/C2)  the 
average  subset  size  and  probability  of  a correct  selection  for  the  Bayes, 
the  best  maximum  type,  and  the  best  average  type  procedure.  They  also  give 
the  proportions  of  times  the  best  maximum  type  and  the  best  average  type 
procedure  coincide  with  the  Bayes  procedure. 

5.  Concluding  remarks 

The  Monte  Carlo  result  of  Chernoff  and  Yahav  (1977)  indicates  that, 
with  respect  to  the  loss  combination  (4)  and  exchangeable  normal  priors, 
maximum  type  procedures  do  almost  as  well  as  the  Bayes  procedures.  The 
result  of  the  present  Monte  Carlo  study  indicates  that,  with  respect  to 
the  loss  combination  (1)  and  exchangeable  normal  priors,  maximum  type 
procedures  do  almost  as  well  as  the  Bayes  procedures.  From  these  results, 
it  seems  reasonable  to  expect  the  maximum  type  procedure  to  do  well  with 
respect  to  the  loss  combinations  (2)  and  (3)  and  exchangeable  normal  priors 
also.  One  final  point  worth  mentioning  is  that  since  the  loss  function  i 

Cj  ICS(0,a)  + C2|a|  depends  only  on  the  relative  ranking  of  the  e^,  the 
results  of  this  study  extends  (approximately)  to  problems  that  can  be 
transformed  monotonically  (approximately)  to  the  normal  populations  problem. 

i 
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Thus  the  result  extends  to  the  problem  of  selecting  a subset  of  log-normal 
populations  in  terms  of  the  means.  The  result  also  sheds  light,  for  example, 
on  the  problem  of  selecting  a subset  of  binomial  populations  since  the 
problem  can  be  transformed  approximately  into  the  normal  populations  problem. 
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This  table  gives  the  estimated  Bayes  risk,  the  regret  incurred  by  the  best 
maximum  type  procedure,  and  the  regret  incurred  by  the  best  average  type 
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